Generalized linear models (GLM) as the name implies are a generalization of the linear modeling framework to allow for the modeling of response variables (e.g. soil attributes) with non-normal distributions and heterogeneous variances. Whereas linear models are designed for predicting continuous soil properties such as clay content or soil temperature, GLM can be used to predict the presence/absence of argillic horizons (i.e. logistic regression) or counts of a plant species along a transect (i.e. Poisson regression). These generalizations greatly expand the applicability of the linear modeling framework, while still allowing for a similar fitting procedure and interpretation of the resulting models.
In the past in order to handle non-linearity and heterogeneous variances, transformations have been made to the response variable, such as the log(x). However, such transformations complicate the models interpretation because the results refer to the transformed scale (e.g. log(x)). These response transformations are not guaranteed to achieve both normality and constant variance simultaneously. GLM approaches transform the response, but also preserve the scale of the response, and provide separate functions to transform the mean response and variance, known as the link and variance functions respectively. So instead of looking like this:
\(f(y) = \beta_{0} + \beta_{1}x + \varepsilon\)
you get this:
\(g(\mu)\) or \(\eta = \beta_{0} + \beta_{1}x + \varepsilon\)
with \(g(\mu)\) or \(\eta\) symbolizing the link function.
Another alteration of the classical linear model is that with GLM the coefficients are estimated iteratively by maximum likelihood estimation instead of ordinary least squares. This results in the GLM minimizing the deviance, instead of the sum of squares. However, for the Gaussian (i.e. normal) distributions the deviance and sum of squares are equivalent.
Logistic regression is a specific type of GLM designed to model data that has a binomial distribution (i.e. presence/absence, yes/no, or proportional data), which in statistical learning parlance is considered a classification problem. For binomial data the logit link transform is generally used. The effect of the logit transform can be seen in the following figure. It creates a sigmoidal curve, which enhances the separation between the two groups. It also has the effect of ensuring that the values range between 0 and 1.
When comparing a simple linear model vs a simple logistic model we can see the effect of the logit transform on the relationship between the response and predictor variable. As before it follows a sigmoidal curve and prevents predictions from exceeding 0 and 1.
Example 1: Probability of Mollisols (Beaudette & O’Geen, 2009)
Example 2: Probability of Red Clay (Evans & Hartemink, 2014)
Example 3: Probability of Ponding (NRCS Unpublished)
Now that we’ve discussed some of the basic background GLM theory we’ll move on to a real exercise, and address any additional theory where it relates to specific steps in the modeling process. The examples selected for this chapter come from Joshua Tree National Park (JTNP)(i.e. CA794) in the Mojave desert. The problem tackled here is a familiar one: Where can I expect to find argillic horizons on fan piedmonts? Argillic horizons within the Mojave are typically found on fan remnants, which are a stable landform that is a remnant of the Pleistocene (Peterson, 1981). Despite the low relief of most fans, fan remnants are uplands in the sense that they generally don’t receive run-on or active deposition.
With this dataset we’ll encounter some challenges. To start with, fan piedmont landscapes typically have relatively little relief. Since most of our predictors will be derivatives of elevation, that won’t leave us with much to work with. Also, our elevation data comes from the USGS National Elevation dataset (NED), which provides considerably less detail than say LiDAR or IFSAR data (Shi et al., 2012). Lastly our pedon dataset like most in NASIS, hasn’t received near as much quality control as have the components. So we’ll need to wrangle some of the pedon data before we can analyze it. These are all typical problems encountered in any data analysis and should be good practice. Ideally, it would be more interesting to try and model individual soil series with argillic horizons, but due to some of the challenges previously mentioned it would be difficult with this dataset. However, at the end we’ll look at one simple approach to try and separate individual soil series with argillic horizons.
To start, as always we need to load some extra packages. This will become a familiar routine every time you start R. Most of the basic functions we need to develop a logistic regression model are contained in base R, but the following contain some useful spatial and data manipulation functions. Believe it or not we will use all of them and more.
library(aqp) # specialized soil classes and functions
library(soilDB) # NASIS and SDA import functions
library(raster) # guess
library(rgdal) # spatial import
library(ggplot2) # graphing
library(tidyr) # data manipulation
library(caret) # classification and regression training
library(car) # additional regression tools
Hopefully like all good soil scientists and ecological site specialists you enter your field data into NASIS. Better yet hopefully someone else did it for you! Once data are captured in NASIS it is much easier to import the data into R, extract the pieces you need, manipulate it, model it, etc. If it’s not entered into NASIS, it may as well not exist.
# pedons <- fetchNASIS()
githubURL <- "https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/ch7_data.Rdata"
load(url(githubURL))
# Examine the makeup of the data we imported from NASIS
str(pedons, max.level = 2)
## Formal class 'SoilProfileCollection' [package "aqp"] with 7 slots
## ..@ idcol : chr "peiid"
## ..@ depthcols : chr [1:2] "hzdept" "hzdepb"
## ..@ metadata :'data.frame': 1 obs. of 1 variable:
## ..@ horizons :'data.frame': 4990 obs. of 43 variables:
## ..@ site :'data.frame': 1168 obs. of 79 variables:
## ..@ sp :Formal class 'SpatialPoints' [package "sp"] with 3 slots
## ..@ diagnostic:'data.frame': 2133 obs. of 4 variables:
Generally before we begin modeling you should spend some time exploring the data. By examining a simple summary we can quickly see the breakdown of how many argillic horizons we have. Unfortunately, odds are good that all the argillic horizons haven’t been consistently populated in the diagnostic horizon table like they should be. Luckily for us, the desert argillic horizons always pop up in the taxonomic name, so we can use pattern matching to extract it. By doing this we gain an additional 11 pedons with argillic horizons and are able to label the missing values (i.e. NA). At a minimum for modeling purposes we probably need 10 pedons of the target we’re interested in and a total of 100 observations overall.
# Check consistency of argillic horizon population
# get the site table
s <- site(pedons)
# tabulate the number of argillic horizons observed
table(s$argillic.horizon, useNA = "ifany")
##
## FALSE TRUE <NA>
## 750 272 146
# or
# summary(s$argillic.horizon)
# Extract argillic presence from the taxonomic subgroup
s$argillic <- grepl("arg", s$tax_subgroup)
table(s$argillic, useNA = "ifany")
##
## FALSE TRUE
## 886 282
Ideally, if the diagnostic horizon table had been populated consistently we could have used the upper depth to diagnostic feature to filter out argillic horizons that start below 50cm, which may not be representative of “good” argillic horizons and may therefore have gotten correlated to a Torripsamments anyway. Not only are unrepresentative sites confusing for scientists, they’re equally confusing for models. However, as we saw earlier, some pedons don’t appear to be fully populated, so we’ll stick with those pedons that have the argillic specified in their taxonomic subgroup name, since it gives us the biggest sample.
d <- diagnostic_hz(pedons)
peiid <- subset(d, diag_kind == "argillic horizon" & featdept < 50, select = peiid)
test <- s$peiid %in% unique(peiid)
summary(test)
## Mode FALSE
## logical 1168
Another obvious place to look is at the geomorphic data in the site table. This information is intended to help differentiate where our soil observations exist on the landscape. If populated consistently it could potentially be used in future disaggregation efforts, as demonstrated by Nauman and Thompson (2014).
# Landform vs argillic presence
# Subset
s_sub <- subset(s, argillic == TRUE)
# Cross tabulate landform vs argillic horizon presence
test <- with(s_sub,
table(landform.string, argillic, useNA = "ifany")
)
# Subset and print landform.string with > 3 observations
test[test > 3,]
## alluvial fan fan apron fan apron & fan remnant
## 9 27 25
## fan remnant hill hillslope
## 109 15 32
## low hill mountain mountain slope
## 5 6 8
## pediment <NA>
## 11 6
# generalize the landform.string
s$landform <- ifelse(grepl("fan|terrace|sheet|drainageway|wash", s$landform.string), "fan", "hill")
Examining the above frequency table we can see that argillic horizons occur predominantly on fan remnants as was alluded too earlier. However, they also seem to occur frequently on other landforms - some of which are curious combinations of landforms or redundant terms.
# Hillslope position
# Subset fan landforms
s_sub <- subset(s, landform == "fan")
# Cross tabulate and calculate proportions, the "2" calculates the proportions relative to the column totals
with(s_sub, round(
prop.table(table(hillslope_pos, argillic, useNA = "ifany"), 2)
* 100)
)
## argillic
## hillslope_pos FALSE TRUE
## Toeslope 18 5
## Footslope 2 2
## Backslope 15 16
## Shoulder 3 4
## Summit 15 32
## <NA> 47 40
# Slope shape
with(s_sub, round(
prop.table(table(paste(shapedown, shapeacross), argillic, useNA = "ifany"), 2)
* 100)
)
## argillic
## FALSE TRUE
## Concave Concave 1 1
## Concave Convex 0 0
## Concave Linear 4 1
## Convex Concave 0 0
## Convex Convex 8 8
## Convex Linear 7 7
## Linear Concave 7 3
## Linear Convex 20 29
## Linear Linear 41 45
## Linear NA 0 0
## NA NA 12 8
Looking at the hillslope position of fan landforms we can see a slightly higher proportion of argillic horizons are found on summits, while less are found on toeslopes. Slope shape doesn’t seem to provide any useful information for distinguishing argillic horizons.
# Surface morphometry, depth and surface rock fragments
# Recalculate gravel
s$surface_gravel <- with(s,
surface_gravel - surface_fgravel
)
# Calculate the total surface rock fragments
s$frags <- apply(s[grepl("surface", names(s))], 1, sum)
# Subset to just look and fans, and select numeric columns
s_sub <- subset(s, landform == "fan", select = c(argillic, bedrckdepth, slope_field, elev_field, frags))
# convert s_sub to wide data format
s_w <- gather(s_sub, key = key, value = value, - argillic)
head(s_w, 2)
## argillic key value
## 1 FALSE bedrckdepth NA
## 2 FALSE bedrckdepth 11
ggplot(s_w, aes(x = argillic, y = value)) +
geom_boxplot() +
facet_wrap(~ key, scale = "free")
Looking at our numeric variables only depth to bedrock seems to show much separation between the presence/absence of argillic horizons.
Next we’ll look at soil scientist bias. The question being: Are some soil scientists more likely to describe argillic horizons than others? Due to the excess number of soil scientist that have worked on CA794, including detailees, we’ve filtered the names of soil scientist to include just the top 3 mappers and given priority to the most senior soil scientists when they occur together.
# Custom function to filter out the top 3 soil scientists
s <- within(s, {
old = describer
describer2 = NA
describer2[grepl("Stephen", old)] = "Stephen" # least senior
describer2[grepl("Paul", old)] = "Paul"
describer2[grepl("Peter", old)] = "Peter" # most senior
})
s_sub <- subset(s, landform == "fan")
# By frequency
with(s_sub, table(describer2, argillic, useNA = "ifany"))
## argillic
## describer2 FALSE TRUE
## Paul 64 28
## Peter 208 86
## Stephen 58 19
## <NA> 134 58
# By proportion
with(s_sub, round(
prop.table(table(describer2, argillic), margin = 1)
* 100)
)
## argillic
## describer2 FALSE TRUE
## Paul 70 30
## Peter 71 29
## Stephen 75 25
For fan landforms, none of the soil scientists seem more likely than the others to describe argillic horizons. However while this information is suggestive, it is far from definitive in showing a potential bias because it doesn’t take into account other factors. We’ll examine this more closely later.
Where do our points plot? We can plot the general location in R, but for this task we will export them to a Shapefile, so we can view them in a proper GIS, and really inspect them. Notice in the figure below the number of points that fall outside the survey boundary. What it doesn’t show is the points that may plot in the Ocean or Mexico!
# Convert soil profile collection to a spatial object
pedons2 <- pedons
slot(pedons2, "site") <- s # this is dangerous, but something needs to be fixed in the site() setter function
idx <- complete.cases(site(pedons2)[c("x", "y")]) # create an index to filter out pedons with missing coordinates
pedons2 <- pedons2[idx]
coordinates(pedons2) <- ~ x + y # set the coordinates
proj4string(pedons2) <- CRS("+init=epsg:4326") # set the projection
pedons_sp <- as(pedons2, "SpatialPointsDataFrame") # coerce to spatial object
pedons_sp <- spTransform(pedons_sp, CRS("+init=epsg:5070")) # reproject
# Read in soil survey area boundaries
# ssa <- readOGR(dsn = "F:/geodata/soils/soilsa_a_nrcs.shp", layer = "soilsa_a_nrcs")
# ca794 <- subset(ssa, areasymbol == "CA794") # subset out Joshua Tree National Park
# ca794 <- spTransform(ca794, CRS("+init=epsg:5070"))
# Plot
plot(ca794, axes = TRUE)
plot(pedons_sp, add = TRUE) # notice the points outside the boundary
# Write shapefile of pedons
writeOGR(pedons_sp, dsn = "C:/workspace2", "pedons_sp", driver = "ESRI Shapefile", overwrite_layer = TRUE)
Prior to any spatial analysis or modeling, you will need to develop a suite of geodata files that can be intersected with your field data locations. This is, in and of itself a difficult task, and should be facilitated by your Regional GIS Specialist. Typically, these geodata files would primarily consist of derivatives from a DEM or satellite imagery. Prior to any prediction it is also necessary to ensure the geodata files have the same projection, extent, and cell size. Once we have the necessary files we can construct a list in R of the file names and paths, read the geodata into R, and then extract the geodata values where they intersect with field data.
# set file path
folder <- "D:/geodata/project_data/R8-VIC/ca794/"
# list of file names
files <- c(
elev = "ned30m_8VIC.tif", # elevation
slope = "ned30m_8VIC_slope5.tif", # slope gradient
aspect = "ned30m_8VIC_aspect5.tif", # slope aspect
twi = "ned30m_8VIC_wetness.tif", # topographic wetness index
twi_sc = "ned30m_8VIC_wetness_sc.tif", # transformed twi
ch = "ned30m_8VIC_cheight.tif", # catchment height
z2str = "ned30m_8VIC_z2stream.tif", # height above streams
mrrtf = "ned30m_8VIC_mrrtf.tif", # multiresolution ridgetop flatness index
mrvbf = "ned30m_8VIC_mrvbf.tif", # multiresolution valley bottom flatness index
solar = "ned30m_8VIC_solar.tif", # solar radiation
precip = "prism30m_8VIC_ppt_1981_2010_annual_mm.tif", # annual precipitation
precipsum = "prism30m_8VIC_ppt_1981_2010_summer_mm.tif", # summer precipitation
temp = "prism30m_8VIC_tmean_1981_2010_annual_C.tif", # annual temperature
ls = "landsat30m_8VIC_b123457.tif", # landsat bands
pc = "landsat30m_8VIC_pc123456.tif", # principal components of landsat
tc = "landsat30m_8VIC_tc123.tif", # tasseled cap components of landsat
k = "gamma30m_8VIC_namrad_k.tif", # gamma radiometrics signatures
th = "gamma30m_8VIC_namrad_th.tif",
u = "gamma30m_8VIC_namrad_u.tif",
cluster = "cluster152.tif" # unsupervised classification
)
# combine the folder directory and file names
geodata_f <- paste0(folder, files)
names(geodata_f) <- names(files)
# Create a raster stack
geodata_r <- stack(geodata_f)
# Extract the geodata and add to a data frame
data <- raster::extract(geodata_r, pedons_sp, sp = TRUE)@data
# Modify some of the geodata variables
idx <- aggregate(mast ~ cluster, data = data, mean, na.rm = TRUE)
names(idx)[2] <- "cluster_mast"
data <- merge(data, idx, by = "cluster", all.x = TRUE)
data <- within(data, {
mast = temp - 4
cluster = factor(cluster, levels = 1:15)
cluster2 = reorder(cluster, cluster_mast)
gsi = (ls_3 - ls_1) / (ls_3 + ls_2 + ls_1)
ndvi = (ls_4 - ls_3) / (ls_4 + ls_3)
sw = cos(aspect - 255)
twi_sc = abs(twi - 13.8) # 13.8 = twi median
})
# save(data, ca794, pedons, file = "C:/workspace2/ch7_data.Rdata")
# Strip out location and personal information before uploading to the internet
# s[c("describer", "describer2", "x", "y", "x_std", "y_std", "utmnorthing", "utmeasting", "classifier")] <- NA
# slot(pedons, "site") <- s
# data[c("describer2", "x_std", "y_std")] <- NA
# save(data, ca794, pedons, file = "C:/workspace2/stats_for_soil_survey/trunk/data/ch7_data.Rdata")
With our spatial data in hand, we can now see whether any of the variables will help us separate the presence/absence of argillic horizons. Because we’re dealing with a classification problem, we’ll compare the numeric variables using boxplots. What we’re looking for are variables with the least amount of overlap in their distribution (i.e. the greatest separation in their median values).
# Load data
load(file = "C:/workspace2/github/stats_for_soil_survey/trunk/data/ch7_data.Rdata")
train <- data
# Select argillic horizons with "arg" in the subgroup name and on fans
# Argillic horizons that occur on hills and mountains more than likely form by different process, and therefore would require a different model.train$argillic
train$argillic <- ifelse(grepl("arg", train$tax_subgroup) &
train$mrvbf > 0.15,
TRUE, FALSE
)
train <- subset(train, !is.na(argillic), select = - c(pedon_id, taxonname, x_std, y_std, landform.string, cluster, cluster_mast, argillic.horizon, tax_subgroup, frags))
train2 <- subset(train, select = - c(describer2, landform, cluster2))
data_m <- gather(train2, key = key, value = value, - argillic)
ggplot(data_m, aes(x = argillic, y = value)) +
geom_boxplot() +
facet_wrap(~ key, scales = "free")
Modeling is an iterative process that cycles between fitting and evaluating alternative models. Compared to tree and forest models, linear and generalized models require more input from the user. Automated model selection procedures are available, but are typically discouraged because they generally result in complex and unstable models. This is in part due to correlation amongst the predictive variables that can confuse the model selection process. In addition, the order is which the variables are included or excluded from the model effects the significance of the others, and thus several weak predictors might mask the effect of one strong predictor. Therefore, it is best to begin with a selection of predictors that are known to be useful, and grow the model incrementally.
The example below is known as a ‘forward selection’ procedure, where a full model is fit and compared against a null model, to assess the importance (or contribution) of the different predictors. For testing alternative models the Akaike’s Information Criterion (AIC) is used. When using AIC to assess predictor significance, a smaller number is better.
# Fit full model
full <- glm(argillic ~ ., data = train, family = binomial) # "~ ." includes all columns in the data set
# Fit null model,
null <- glm(argillic ~ 1, data = train, family = binomial) # "~ 1" just includes an intercept
# Compute AIC
add1(null, full)
## Single term additions
##
## Model:
## argillic ~ 1
## Df Deviance AIC
## <none> 872.18 891.63
## describer2 3 853.25 878.70
## landform 1 768.81 790.26
## elev 1 869.79 891.24
## slope 1 704.62 726.07
## aspect 1 869.85 891.30
## twi 1 837.20 858.65
## twi_sc 1 690.72 712.17
## ch 1 870.81 892.26
## z2str 1 805.75 827.20
## mrrtf 1 823.82 845.27
## mrvbf 1 819.41 840.86
## solar 1 861.83 883.28
## precip 1 864.20 885.65
## precipsum 1 854.28 875.73
## temp 1 870.08 891.53
## ls_1 1 871.93 893.38
## ls_2 1 869.79 891.24
## ls_3 1 864.88 886.33
## ls_4 1 860.60 882.05
## ls_5 1 846.15 867.60
## ls_6 1 847.19 868.64
## pc_1 1 855.78 877.23
## pc_2 1 836.95 858.40
## pc_3 1 858.12 879.57
## pc_4 1 869.35 890.80
## pc_5 1 872.17 893.62
## pc_6 1 863.85 885.30
## tc_1 1 861.58 883.03
## tc_2 1 869.32 890.77
## tc_3 1 836.92 858.37
## k 1 870.13 891.58
## th 1 871.26 892.71
## u 1 872.02 893.47
## mast 1 870.08 891.53
## cluster2 12 760.01 803.46
## gsi 1 835.26 856.71
## ndvi 1 868.77 890.22
## sw 1 872.16 893.61
According to the add1() function, we can see that twi_sc gives us a model with the smallest AIC and residual deviance. This confirms what the exploratory analysis with the boxplots suggested to us earlier. So let’s add twi_sc to the null model using the update() function. Then continue using the add1() or drop1() functions, until the model is saturated.
# add twi_sc to the model, "-" will subtract predictors
argi.glm <- update(null, . ~ . + twi_sc)
# or refit
argi.glm <- glm(argillic ~ twi_sc, data = train, family = binomial)
# iterate until the model is saturated
add1(argi.glm, full)
If we keep iterating we might select a model similar to the one below.
argi.glm <- glm(argillic ~ twi_sc + slope + ls_1 + ch + z2str + mrvbf, data = train, family = binomial)
However, in addition to adding variables via ‘forward selection’, we can also test the effect of dropping variables via ‘backward selection’ with the `drop1()’ function, and examine their significance.
# Compute AIC
drop1(argi.glm, test = "Chisq")
## Single term deletions
##
## Model:
## argillic ~ twi_sc + slope + ls_1 + ch + z2str + mrvbf
## Df Deviance AIC LRT Pr(>Chi)
## <none> 600.54 614.54
## twi_sc 1 655.50 667.50 54.961 1.230e-13 ***
## slope 1 640.26 652.26 39.720 2.932e-10 ***
## ls_1 1 607.48 619.48 6.937 0.0084418 **
## ch 1 607.30 619.30 6.761 0.0093176 **
## z2str 1 615.05 627.05 14.511 0.0001393 ***
## mrvbf 1 605.26 617.26 4.719 0.0298311 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It appears all the variables exceed the typical significance of 0.05.
After the model is saturated and all the variables are significant, you should end up with a model similar to the one below.
# Examine the effect and error for each predictors
summary(argi.glm)
##
## Call:
## glm(formula = argillic ~ twi_sc + slope + ls_1 + ch + z2str +
## mrvbf, family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.52678 -0.56621 -0.11136 -0.00255 2.87734
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.951983 0.810486 4.876 1.08e-06 ***
## twi_sc -0.630857 0.097191 -6.491 8.53e-11 ***
## slope -0.248590 0.053932 -4.609 4.04e-06 ***
## ls_1 -0.024334 0.009363 -2.599 0.00935 **
## ch -0.002913 0.001175 -2.479 0.01318 *
## z2str -0.020934 0.006033 -3.470 0.00052 ***
## mrvbf -0.285661 0.133716 -2.136 0.03265 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 875.83 on 989 degrees of freedom
## Residual deviance: 600.54 on 983 degrees of freedom
## (40 observations deleted due to missingness)
## AIC: 614.54
##
## Number of Fisher Scoring iterations: 8
# Convert the coefficients to an odds scale, who here gambles?
exp(coef(argi.glm))
## (Intercept) twi_sc slope ls_1 ch z2str
## 52.0384651 0.5321354 0.7798994 0.9759592 0.9970909 0.9792839
## mrvbf
## 0.7515171
# Importance of each predictor assessed by the amount of deviance they explain
anova(argi.glm)
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: argillic
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev
## NULL 989 875.83
## twi_sc 1 182.883 988 692.95
## slope 1 54.734 987 638.22
## ls_1 1 18.532 986 619.68
## ch 1 3.815 985 615.87
## z2str 1 10.608 984 605.26
## mrvbf 1 4.719 983 600.54
After we’re satisfied no additional variables will improve the fit, we need to evaluate it’s residuals, collinearity, accuracy, and model coefficients.
# Residual Plots for GLM
par(mfrow = c(2, 2))
plot(argi.glm)
residualPlots(argi.glm, fit = FALSE, tests = FALSE)
The variance inflation factor (VIF) is used to assess collinearity amongst the predictors. Its square root indicates the amount of increase in the predictor coefficients standard error. A value greater than 2 indicates a doubling the standard error. Rules of thumb vary, but a square root of vif greater than 2 or 3 indicates an unacceptable value.
# Variance inflation, greater than 5 or 10 is bad
vif(argi.glm)
## twi_sc slope ls_1 ch z2str mrvbf
## 1.100552 1.684762 1.170740 1.163639 1.373601 1.879715
Because we’re dealing with a classification problem, we have to consider both errors of commission (Type I) and omission (Type II), or their corresponding accuracies of sensitivity (producer’s accuracy) and positive predicted value (user’s accuracy or precision) respectively. Before we can assess the error, however, we need to select a probability threshold.
# examine possible thresholds
train$predict <- predict(argi.glm, train, type = "response")
ggplot(train, aes(x = predict, fill = argillic)) +
geom_density(alpha = 0.5) +
geom_vline(aes(xintercept = 0.5), lty = "dashed") +
xlab("probability") +
scale_x_continuous(breaks = seq(0, 1, 0.2))
train$predict <- train$predict > 0.35
# Confusion Matrix
cm <- table(predicted = train$predict, observed = train$argillic)
confusionMatrix(cm, positive = "TRUE")
## Confusion Matrix and Statistics
##
## observed
## predicted FALSE TRUE
## FALSE 733 67
## TRUE 97 93
##
## Accuracy : 0.8343
## 95% CI : (0.8097, 0.857)
## No Information Rate : 0.8384
## P-Value [Acc > NIR] : 0.65423
##
## Kappa : 0.4317
## Mcnemar's Test P-Value : 0.02354
##
## Sensitivity : 0.58125
## Specificity : 0.88313
## Pos Pred Value : 0.48947
## Neg Pred Value : 0.91625
## Prevalence : 0.16162
## Detection Rate : 0.09394
## Detection Prevalence : 0.19192
## Balanced Accuracy : 0.73219
##
## 'Positive' Class : TRUE
##
# Deviance squared
library(modEvA)
Dsquared(argi.glm)
## [1] 0.314319
# Adjusted deviance squared
Dsquared(argi.glm, adjust = TRUE)
## [1] 0.3101338
library(dplyr)
temp <- subset(train, argillic == TRUE) %>%
group_by(cluster2) %>%
summarize(
sum_arg = sum(argillic, na.rm = TRUE),
sum_pred = sum(predict, na.rm = TRUE),
sensitivity = round(sum(predict == argillic) / length(argillic), 2)
)
ggplot(temp, aes(x = cluster2, y = sensitivity)) +
geom_point()
# Remove outlier clusters
train_sub <- subset(train, ! cluster2 %in% c(12, 7))
# full <- glm(argillic ~ ., data = train_sub, family = binomial)
# null <- glm(argillic ~ 1, data = train_sub, family = binomial)
# add1(null, full, train = "Chisq")
sub.glm <- glm(argillic ~ slope + twi_sc + ls_1 + mrvbf + z2str + ch, data = train_sub, family = binomial)
# summary(sub.glm)
train_sub$predict <- predict(sub.glm, train_sub, type = "response") > 0.35
cm <- table(predicted = train_sub$predict, observed = train_sub$argillic)
confusionMatrix(cm, positive = "TRUE")
## Confusion Matrix and Statistics
##
## observed
## predicted FALSE TRUE
## FALSE 622 52
## TRUE 81 90
##
## Accuracy : 0.8426
## 95% CI : (0.8163, 0.8665)
## No Information Rate : 0.832
## P-Value [Acc > NIR] : 0.21826
##
## Kappa : 0.4795
## Mcnemar's Test P-Value : 0.01519
##
## Sensitivity : 0.6338
## Specificity : 0.8848
## Pos Pred Value : 0.5263
## Neg Pred Value : 0.9228
## Prevalence : 0.1680
## Detection Rate : 0.1065
## Detection Prevalence : 0.2024
## Balanced Accuracy : 0.7593
##
## 'Positive' Class : TRUE
##
temp <- subset(train_sub, argillic == TRUE) %>%
group_by(cluster2) %>%
summarize(
sum_arg = sum(argillic, na.rm = TRUE),
sum_pred = sum(predict, na.rm = TRUE),
sensitivity = round(sum(predict == argillic) / length(argillic), 2)
)
ggplot(temp, aes(x = cluster2, y = sensitivity)) +
geom_point()
# Custom function to return the predictions and their standard errors
predfun <- function(model, data) {
v <- predict(model, data, type = "response", se.fit = TRUE)
cbind(
p = as.vector(v$fit),
se = as.vector(v$se.fit)
)
}
# Generate spatial predictions
r <- predict(geodata_r, argi.glm, fun = predfun, index = 1:2, progress = "text")
# Export the results
writeRaster(r[[1]], "argi.tif", overwrite = T, progress = "text")
writeRaster(r[[2]], "argi_se.tif", overwrite = T, progress = "text")
plot(raster("C:/workspace2/argi.tif"))
plot(ca794, add = TRUE)
plot(raster("C:/workspace2/argi_se.tif"))
plot(ca794, add = TRUE)
# Download clipped example from Pinto Basin Joshua Tree
githubURL <- "https://raw.githubusercontent.com/ncss-tech/stats_for_soil_survey/master/data/logistic/argi_pb.zip"
download.file(githubURL, destfile = "C:/workspace2/argi_pb.zip")
unzip(zipfile="C:/workspace2/argi_pb.zip", exdir="C:/workspace2")
Beaudette, D. E., & O’Geen, A. T, 2009. Quantifying the aspect effect: an application of solar radiation modeling for soil survey. Soil Science Society of America Journal, 73:1345-1352
Gessler, P. E., Moore, I. D., McKenzie, N. J., & Ryan, P. J, 1995. Soil-landscape modelling and spatial prediction of soil attributes. International Journal of Geographical Information Systems, 9:421-432
Gorsevski, P. V., Gessler, P. E., Foltz, R. B., & Elliot, W. J, 2006. Spatial prediction of landslide hazard using logistic regression and ROC analysis. Transactions in GIS, 10:395-415
Evans, D.M. and Hartemink, A.E., 2014. Digital soil mapping of a red clay subsoil covered by loess. Geoderma, 230:296-304.
Hosmer Jr, D.W., Lemeshow, S. and Sturdivant, R.X., 2013. Applied logistic regression (Vol. 398). John Wiley & Sons
Kempen, B., Brus, D. J., Heuvelink, G., & Stoorvogel, J. J. (2009). Updating the 1: 50,000 Dutch soil map using legacy soil data: A multinomial logistic regression approach. Geoderma, 151:311-326.
Nauman, T. W., and J. A. Thompson, 2014. Semi-automated disaggregation of conventional soil maps using knowledge driven data mining and classification trees. Geoderma 213:385-399. http://www.sciencedirect.com/science/article/pii/S0016706113003066
Peterson, F.F., 1981. Landforms of the basin and range province: defined for soil survey. Nevada Agricultural Experiment Station Technical Bulletin 28, University of Nevada - Reno, NV. 52 p. http://jornada.nmsu.edu/files/Peterson_LandformsBasinRangeProvince.pdf
Shi, X., L. Girod, R. Long, R. DeKett, J. Philippe, and T. Burke, 2012. A comparison of LiDAR-based DEMs and USGS-sourced DEMs in terrain analysis for knowledge-based digital soil mapping. Geoderma 170:217-226. http://www.sciencedirect.com/science/article/pii/S0016706111003387
Lane, P.W., 2002. Generalized linear models in soil science. European Journal of Soil Science 53, 241- 251. http://onlinelibrary.wiley.com/doi/10.1046/j.1365-2389.2002.00440.x/abstract
James, G., D. Witten, T. Hastie, and R. Tibshirani, 2014. An Introduction to Statistical Learning: with Applications in R. Springer, New York. http://www-bcf.usc.edu/~gareth/ISL/
Hengl, T. 2009. A Practical Guide to Geostatistical Mapping, 2nd Edt. University of Amsterdam, www.lulu.com, 291 p. ISBN 978-90-9024981-0. http://spatial-analyst.net/book/system/files/Hengl_2009_GEOSTATe2c0w.pdf